Given large amount of real photos for training, Convolutional neural networkshows excellent performance on object recognition tasks. However, the processof collecting data is so tedious and the background are also limited whichmakes it hard to establish a perfect database. In this paper, our generativemodel trained with synthetic images rendered from 3D models reduces theworkload of data collection and limitation of conditions. Our structure iscomposed of two sub-networks: semantic foreground object reconstruction networkbased on Bayesian inference and classification network based on multi-tripletcost function for avoiding over-fitting problem on monotone surface and fullyutilizing pose information by establishing sphere-like distribution ofdescriptors in each category which is helpful for recognition on regular photosaccording to poses, lighting condition, background and category information ofrendered images. Firstly, our conjugate structure called generative model withmetric learning utilizing additional foreground object channels generated fromBayesian rendering as the joint of two sub-networks. Multi-triplet costfunction based on poses for object recognition are used for metric learningwhich makes it possible training a category classifier purely based onsynthetic data. Secondly, we design a coordinate training strategy with thehelp of adaptive noises acting as corruption on input images to help bothsub-networks benefit from each other and avoid inharmonious parameter tuningdue to different convergence speed of two sub-networks. Our structure achievesthe state of the art accuracy of over 50\% on ShapeNet database with datamigration obstacle from synthetic images to real photos. This pipeline makes itapplicable to do recognition on real images only based on 3D models.
展开▼